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Detail-preserving and Content-aware Variational 
Multi-view Stereo Reconstruction 

Zhaoxin Li, Kuanquan Wang, Wangmeng Zuo, Deyu Meng and Lei Zhang 


Abstract —Accurate recovery of 3D geometrical surfaces from 
calibrated 2D multi-view images is a fundamental yet active 
research area in computer vision. Despite the steady progress in 
multi-view stereo reconstruction, most existing methods are still 
limited in recovering fine-scale details and sharp features while 
suppressing noises, and may fail in reconstructing regions with 
few textures. To address these limitations, this paper presents a 
Detail-preserving and Content-aware Fariational (DCV) multi¬ 
view stereo method, which reconstructs the 3D surface by 
alternating between reprojection error minimization and mesh 
denoising. In reprojection error minimization, we propose a novel 
inter-image similarity measure, which is effective to preserve 
fine-scale details of the reconstructed surface and builds a 
connection between guided image filtering and image registration. 
In mesh denoising, we propose a content-aware ^-minimization 
algorithm by adaptively estimating the p value and regulariza¬ 
tion parameters based on the current input. It is much more 
promising in suppressing noise while preserving sharp features 
than conventional isotropic mesh smoothing. Experimental results 
on benchmark datasets demonstrate that our DCV method is 
capable of recovering more surface details, and obtains cleaner 
and more accurate reconstructions than state-of-the-art methods. 
In particular, our method achieves the best results among all 
published methods on the Middlebury dino ring and dino sparse 
ring datasets in terms of both completeness and accuracy. 

Index Terms —Multi-view stereo, reprojection error, feature¬ 
preserving, £ p minimization, mesh denoising. 


I. INTRODUCTION 

ULTI-VIEW stereo (MVS), which aims at inferring a 
scene’s 3D geometric surface from a set of calibrated 
2D images captured in different views, is a fundamental prob¬ 
lem in computer vision. Due to its capability of high-quality 
reconstruction for both indoor and outdoor scenes, MVS has 
been widely used in science and engineering |[T0|-|T2|. Driven 
by the MVS benchmark datasets in (lj and (2]|7 various 
MVS algorithms have been proposed to gradually improve 
the accuracy and completeness of MVS reconstruction 0. 
GEQ, (43), 1621, and MVS remains an active research area 
that attracts considerable attentions 0-0. 

The performance of existing MVS methods is limited due 
to factors such as violation of the Lambertian reflectance 
model, inaccurate camera calibration, lack of textures on the 
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object, and false matches. Therefore, noises are inevitable 
for the reconstructed 3D surface, resulting in degraded accu¬ 
racy and visually unpleasant artifacts. A number of methods, 
e.g., weighted minimal surface models [ 13], [ 14], have been 
proposed to suppress noises. However, this line of methods 
usually impose isotropic smoothness prior on 3D models, and 
tend to over-smooth fine-scale details and sharp features. 

To overcome these limitations, various methods have been 
developed to suppress noise while preserving sharp features. 
Based on the 3D model representation, these methods can 
be grouped into three categories, i.e., point cloud-based, 
volumetric-based, and mesh-based. For point cloud-based 
methods, smooth prior is introduced in (58) to improve the 
accuracy of local matches on each stereo pairs. In (43), (59) , 
accurate point clouds on high-textured regions are generated 
by deploying reliable features, and then propagated to the 
neighbouring regions. Besides, several heuristic strategies go), 
(6l) are suggested to evaluate the reliability of each point and 
remove noises based on local geometry orientation, photomet¬ 
ric and visibility. For these methods (43) , (58)- gD- meshing 
point clouds are usually required to generate the final 3D 
surface, which may lead to over-smoothing in thinly protruding 
structures. Besides, noises and missing data of point clouds 
could be propagated to the meshing step, resulting in artifacts 
in the final reconstruction. 

In volumetric-based methods, a photoflux term of pho¬ 
toconsistency (25] is introduced to provide data-driven bal¬ 
looning force toward maximal photo-consistent surface. Such 
an energy term is helpful in segmenting thin structures, but 
fail to recover the structures on concave regions. Kolev et 
al. (4]| added a stereo regional term to enforce the background 
constraint based on a set of depth-maps. The regional term 
can be updated along with iterations to infer the occluded 
regions, making the method work well in recovering both the 
protrusion structure and concave regions. Kostrikov et al. (24) 
further improved the method in [4] by proposing a robust 
camera selection algorithm for labelling voxels as interior or 
exterior. However, high memory requirements of volumetric- 
based methods hamper their applications in large-volume and 
high-quality MVS reconstruction. 

For mesh-based MVS, a number of variational methods 
0 0 || 19), [291—[ 32J have been proposed to improve the 
reconstruction quality. They can also be employed as a re¬ 
finement step of other methods for high-quality reconstruction 
|3T) -]|33), (43) . However, most existing mesh-based methods 
adopt isotropic mesh smoothing, where the photoconsistency 
is computed by the zero-mean normalized cross-correlation 
(ZNCC). This makes them often fail in recovering the fine 
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details and sharp features of object surface. 

In addition to the above methods, other cues, e.g., silhou¬ 
ettes ed, m -|[23| and surface orientation [[261 , can also 
be incorporated to help 3D reconstruction. The silhouettes 
of an object can be fused to ensure that the reconstructed 
surface preserves the protrusions and indentions. They can 
be adopted in either volumetric-based (2T)-|[23) or mesh- 
based methods [9], (T9) , [[20]. However, the incorporation 
of silhouettes cannot guarantee the preservation of fine-scale 
surface details and sharp features that are not on the contour 
generators of surface. The surface orientation of the observed 
shape can be employed to design an anisotropic weight surface 
(26) . However, the computation of surface orientation needs 
accurate second-order surface derivative, and the constant 
albedo assumption may not hold (27) , (28) , making surface 
orientation only be applicable to some restricted scenarios. 

Compared with point cloud-based and volumetric-based 
methods, mesh-based methods are more feasible for recon¬ 
structing high-resolution surface with low memory require¬ 
ments, but the accuracy of the mesh-based methods is gen¬ 
erally limited, partially due to the use of isotropic mesh 
smoothing and ZNCC-based inter-image similarity measure. 
To address these issues, in this paper we propose a novel inter¬ 
image similarity measure and a content aware mesh denoising 
algorithm, resulting in a detail-preserving and content-aware 
variational (DCV) method for MVS. As shown in Fig. [T] the 
contribution of this work is two-fold: 

• An inter-image similarity measure is proposed to pre¬ 
serve fine-scale details of the reconstructed surface. The 
proposed similarity measure also builds a connection 
between guided image filtering (34) and image registra¬ 
tion, making our measure have promising edge-preserving 
performance. 

• A content-aware ^-minimization algorithm is proposed 
for mesh denoising. By adaptively estimating a suitable p 
value and regularization parameters, our algorithm works 
very well in mesh smoothing while preserving sharp 
features. 


Extensive experimental results on benchmark datasets validate 
the superiority of our DCV method in accurate 3D reconstruc¬ 
tion. Moreover, DCV achieves the best results among all pub¬ 
lished methods on the Middlebury dino ring and dino sparse 
ring datasets in terms of both completeness and accuracy. 

The paper is organized as follows. Section [II] introduces 
the related work. Section [HI] briefly introduces the concept of 
reprojection error and its minimization. Section [IV] presents 
the pipeline of our method and its two major components, i.e., 
detail-preserving similarity measure and content-aware mesh 
denoising, respectively. Section [V] presents the experimental 


results. Finally, the paper is concluded in VI 


II. Related Work 

This section gives a brief survey on mesh-based MVS 
methods according to the two key elements which decide the 
quality of MVS reconstruction: data fidelity and regularization. 
The mesh-based MVS methods can provide high-resolution 
reconstruction with low memory requirements, and they are 


convenient to accelerate by using graphic hardware. Due to 
these advantages, mesh-based representation has been widely 
adopted in state-of-the-art MVS methods for 3D reconstruction 
of indoor [ 5 j, [|T9| and outdoor scenarios [31], (32 J, and 
surface refinement (3T| , (32 1, (43). 

Data fidelity. Data fidelity is used to measure photometric 
consistency between images. In some early works 0 0, 
(19) , data fidelity is measured by comparing projections of 
surface points (or a planar patch tangent to surface point) with 
the corresponding neighbouring images, i.e., photoconsistency. 
The total consistency of the mesh surface is a summation of 
photoconsistency over all the mesh vertices. The main draw¬ 
back of this measure is the projective distortion occurred in 
the high curvature regions of objects. Another line of methods 
compare the image pixels with rendered surface textures 
(28) by implicitly assuming controlled lighting environments. 
The reprojection error minimization framework, which is also 
known as reprojection error functional 0. ® ID (36), 
attempts to solve the MVS problem by comparing the observed 
and predicted values of pixels generated from the reconstructed 
surface. The total consistency of the mesh surface is measured 
on image space instead of 3D surface space to alleviate 
projective distortion. It can also be considered as an image 
registration problem (37) , i.e., registering the input images and 
their predicted images. 

A similarity measure is needed to measure the reprojec¬ 
tion error. In previous works, differentiable and isotropic 
similarity measures have been widely used, such as Zero- 
mean Normalized Cross Correlation (ZNCC) (29), (3l) , (32 , 
(37) and Sum of Square Difference (SSD) (30), (35), (36). 
However, these isotropic similarity measures tend to flatten 
or smoothen the sharp features of surface, and are limited in 
recovering fine-scale details. Edge-aware anisotropic methods 
can be used to replace the isotropic ones. Actually, anisotropic 
methods have been independently proposed in binocular stereo 
vision [15|-|fT8). Among them, guided image filtering has 
been used |16| , ED due to its effectiveness and efficiency. 
However, the guided image filter in stereo vision is employed 
to filter discrete disparity space images (DSI), and cannot be 
directly adopted in the reprojection error minimization frame¬ 
work, where a variational measure is necessary. The proposed 
method fills the gap between guided image filtering-based 
anisotropic measure and variational-based image registration, 
and it is effective in reconstructing fine-scale details. 

Surface regularization: Mesh regularization methods are 
introduced to improve the smoothness while preserving details 
of 3D surface, which can be divided into two categories, 
i.e., surface smoothing and denoising. For surface smoothing, 
several mesh smoothing operators, e.g., discrete Laplace- 
Beltrami operator (38) , have been adopted as band-pass filters 
in MVS. Other smoothing methods, such as mean curvature 
motion [35], [36] and gradient flow (6), have been studied 
and applied to mesh-based MVS 0. |30). To improve the 
computational efficiency, different approximations, such as 
Laplacian approximation 0. (20), (29) , umbrella operator, 
(19) and paraboloid approximation (28), have been proposed. 
Higher order derivatives, e.g., combination of the first- and 
second-order Laplace (43) , thin-plate energy (31), (32|, have 
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Fig. 1: Overview of the proposed DCV method. 


been suggested to handle artificial shrinkage of small compo¬ 
nents and to penalize strong bending. However, higher order 
derivatives are not well defined at regions with sharp features 
and their computation is sensitive to noise. 

Mesh denoising aims to remove the noises or spurious 
details while preserving sharp edge and corner features, which 
can be further classified into three sub-categories. The first one 
is based on bilateral filtering on vertexes (39| ; the second one 
combines normal filtering and vertex position updating |40| , 
ED; and the third one is based on optimizing an 4o norm based 
non-convex energy function j42j. In this work, we propose 
a novel mesh-denoising method by considering the gradient 
distribution of surface meshes, which is formulated as an l p - 
minimization problem in the maximum a posteriori (MAP) 
framework. Moreover, we adaptively select the p value and 
regularization parameters, making our method content-aware 
to preserve sharp features. 



(a) (b) 

Fig. 2: Illustration of visibility of the surface, (a) Visible parts of 
surface for each camera: S t for camera i (center C ; ) and S j for camera 
j (center Cj). S t j is the shared visible part for cameras i and j. (b) 
The interior points are these surface points which are visible from 
both stereo pair and not in the contour generators of the surface. 
The horizons are the points in the contour generators for a specific 
camera. The terminators are occluded by horizons and are behind the 
horizons along with the camera ray tracing. 


III. Prerequisites: Reprojection Error and Its Minimization 


Let S c R 3 denote a reconstructed surface of object, 
B c R 3 stand for its background, and 4 • Q c R 2 —> R d 
denote the observed (input) image in camera i (d = 1 for 
grayscale images, and d = 3 for color images). In image 
formation, the observed image only records the visible (i.e., 
unoccluded) part of a real scene, which includes foreground 
from both the interested object and the irrelevant background. 
As shown in Fig. 2(a)[ Si is the visible part of surface for 


camera i. We define S t j as the shared visible surface of 
camera i and camera j. Let ni : R 3 —> Q* be the perspective 
projection which projects 3D point x to 2D pixel p. Let I^s,b 
be the predicted image of I t via surface and background. 
4s is the predicted image for object part and 4# is the 
predicted image for background part. With the desired 3D 
reconstruction of object S and background B , it is natural to 
assume that the image predicted by 3D object and background 
models should be similar to the observed image. Therefore, 
the minimization framework of reprojection errors adopts the 
following functional |5j, f30j, [35], (36) : 


E im (S) 




gdiip), 4s,b(p))4> 


=y..[ r £f(7i(p),4s(pm>+ r i 4 (/,(p),/,./j(p))jpi, 

1 JjTioSi JQi-TTioSi 

( 1 ) 

where zr z - o St denotes the projection of surface S t onto 4, 
the reprojection error g(I, /)(p) denotes the similarity measure 


between images I and / at pixel p. The reprojection error 
g l F (Ii(v)Ji,s (p)) measures the similarity between image 4 
and its predicted image via surface of the object, and the 
reprojection error g^(4(p), 4,s(p)) measures similarity between 
image 4 and predicted background image. 

The predicted image can be generated via rendering surface 
and background. In particular, I i S is defined based on stereo 
pairs and is usually not a single image. One of the predicted 
images of 4 can be computed by first projecting its neigh¬ 
bouring image Ij onto the reconstructed surface S and then 
projecting to image space of camera i, which actually defines 
a predicted image 4. The valid definition domain for 4j 5 
is the projection of shared visible surface Sy, i.e., ni o Stj. 
By counting all the neighbouring images of 4, g l F is defined 
as: 

g' F ( P) = Y^ j m{lh )( P) = L; ( 2 ) 

where m is a similarity measure of two pixels in a small 
squared window centered on p. The definition domain of pre¬ 
dicted image 4# of 4 via background is defined by 
i.e., the supplementary set of n t o S[. To simplify the com¬ 
putation, we assume that the background is uniformly black, 
and this can be implemented by segmenting silhouettes from 
observed images. Based on the fact that du = -x • n(x)/x 3 Js, 
with simple algebra, Eq. 0 can be rewritten as an integral 
over the surface by counting only the visible points (see (36) 
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for details): 

E im (S) = J' - - [g' f -(x, n(x)) - g' B (x, n(x))] \ s (x)ds, 

(3) 

where n(x) is the outward normal of surface S on point x, and 
A ^ : R 3 —> [0,1] is the visibility function which equals to 1 
if x is visible from camera i and 0 otherwise. 

The functional of reprojection errors in Eq. © can be 
reformulated on mesh-based discrete representation. Let’s 
parametrize surface S to a triangle mesh M with a set of 
indexes <V = {v\, V 2 ,... v n ) and a set of triangular faces 
T - {/i, fi, • • •, /«}, fi € TxTxT. The geometric embedding 
of a triangle mesh into R 3 is specified by associating each 
vertex to a 3D position. Let x, denote the position of a vertex 
Vi. Over each triangular face, points are parametrized using 
barycentric coordinates x(u) : u = (u, v) e T = {(u, v)\u e 
[0,1], v e [0,1 - u]]. The energy functional on the triangle 
mesh is formulated as follows: 

EUM) = G(x)N k Ai S (x)du, (4) 

where G(x) = ~(x/x 3 z ) ■ [g' r (x, n(x)) - g^(x,n(x))J, N k and A k 
are the normal and area of the triangle f k , respectively, and the 
term du = 2A k ds corresponds to the unit surface area element 
in the triangle mesh. 

The energy functional in Eq. 0 can be optimized by using 
gradient decent over all the vertices of the mesh. According to 
[5J, |30l , (35), (36}, the evolution equation for gradient decent 
flow is: 

H 0) = X ° (5) 

\d*k/dt = -(1 /Ak) [M™ + M^ oriz J ’ 


evolve the current surface S, we should estimate the derivative 
of g l F (x). As shown in [37], the gradient of the similarity mea¬ 
sure rriij with respect to an infinitesimal vector displacement 
SS of 3D surface point x can be computed using the chain 
rule: 


lim 

£—>{) 


dniij(S + eSS ) 
de 


C dm(li, I i jS 
= lim —-— 

e_>0 dmoSij Iij,s (p j) 


dm(Ii, Iijs ) dlij,s d Pj dnj} 
x—x—-x —— 
d\)j dx de 

( 8 ) 


and we have 


v^(S)(x) = -J] 


VnioSi 


dm(Ij, Iijs) dlij, s d Pj d( 


j,i*j 


h,j,s(Pj) dPj dx 


x^x-n, (9) 


where p ; and p ■ are the pixel positions in images I) and 
Iijjs, respectively, n~ l s : M 3 —> Q z is the inverse projection 
which projects pixels from camera i onto the surface, and d z 
is the vector joining the center of camera i and x, rj is the 
Kronecker symbol which cancels the gradient computation in 
the region outside the shared visible surface of both cameras. 
When the surface moves, the predicted image tends to be 
changed. Hence, the variation of reprojection errors involves 
the derivative of the similarity measure with respect to its 
second argument , i.e., d^m/ j, as shown in the first 
derivative term of the right part of Eq. Therefore, the 
variation of predicted images will affect much the 3D shape 
of surface. 


IV. Proposed Method 

Following the mesh-based MVS framework, our DCV 
model consists of two terms, i.e., data fidelity E\ m and surface 
regularization E rcg . The energy functional of our model can 
be formulated as: 


where M! nt is defined as: 


V k J^ k 2A k N k £vG(x)(1 


u - v)dudv, 


( 6 ) 


and Mr z is defined as: 


z, 


horizon edges H kJ J m£ |- 0;1 ] 2 


f 

•J Jue [ 


i[G(7\y))- 


y A H k j 

G(y)]-——jjhl 
lyllyl? 


u)du, 

(7) 

where 14 is the velocity vector, H^j is the vector such that 
[. Xk, Xk + H k j^ is the edge of the triangular face fj generating 
the horizon, y is defined as y = x k + uH k j, and T(x) is the 
terminator of x. The definitions of Horizon and Terminator 
are illustrated in Fig. 


2(b) M[ nt is the gradient for the vertex 


(interior point) that does not change its visibility state, and 
M h k onz is the gradient for the vertex that exihibits strong 
changes in visibility during the evolution. 

The term M k onz is used to confine the horizons of the 
surface in different cameras. Although its influence will be 
considerably decreased by the introduction of surface regular¬ 
ization, this term is very useful for persevering thin protruding 
structures on the border between object and background. This 
naturally corresponds to a silhouette constriant (I9)-g3). The 
form of g l B decides the consistency of reconstructed model 
with silhouettes, and SSD can be used to measure this error. 

The term M£ l is crucial to the reconstruction quality. To 


E(S) = E im (S) + AE YQg (S) (10) 

where S denotes the reconstructed surface of the object, 
and A is the trade-off parameter. Note that E im usually is 
differentiable while E reg is non-smooth. The model can be 
solved by extending the proximal gradient algorithm (63} , 
which iteratively performs the following two steps. 

Step 1. Gradient Descent. Given the current estimate S k , 
the gradient descent algorithm is adopted to minimize the data 
fidelity term E im : 

S k+0 - 5 = S k+0 - 5 - T]dE im {S )/dS, (11) 

where rj is the stepsize. 

Step 2. Surface Denoising. Given S k+0 - 5 , the reconstructed 
surface S is further refined by solving the following mesh 
denoising problem: 

S k+l = argmin I||S - S k+05 \\ 2 + A V E ieg (S). (12) 

Given the nonsmooth convex function E rcg and the smooth 
convex function E\ m with Lipschitz constant L, when the 
stepsize rj < l/L and the surface denoising problem has 
the global solution, the algorithm can converge to the global 
optimum (63| . For our case, even E rcg is nonconvex, our 
algorithm empirically converges to a satisfactory solution. In 
this work, we propose a detail-preserving similarity measure 


dpi, 
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for Step 1 and propose a content-aware mesh denoising 
algorithm for Step 2, which will be described in detail in 
the following two sub-sections, respectively. 


A. Detail-preserving Inter-image Similarity Measure 

The similarity measure m(f, ) is critical for the min¬ 
imization of reprojection error between /, and its predicted 
image I^s in 7 T/ o S ij . In the variational framework, it is de¬ 
sirable that the similarity measure ra(/;, ) is differentiable. 

Among the existing similarity measures ]45[-]47|, zero-mean 
normalized cross correlation (ZNCC) is the most commonly 
used one due to the following advantages: (1) it is robust to 
inter-image affine illumination variation; and (2) its derivative 
can be efficiently computed. However, the isotropic property 
of ZNCC treats all pixels equally and prefers to flatten the 
details of surface. In this section, we first review the derivative 
of ZNCC-based similarity measure and then propose a detail¬ 
preserving similarity measure based on the principle of guided 
image filtering. 

1) Derivative of ZNCC-based Similarity Measure: The 
ZNCC measure is defined as follows: 

m(/i,/ 2 )(p) = v 1>2 (p)/ V v i((P)) y 2 (PX (13) 

where vi, V 2 and vi, 2 are given by 

v,-(p) = G,r * lf(p)/oj(p) - fij(p) + e, (14) 

vi, 2 (p) = Go- * /i/ 2 (p)M(p)) - //i/i 2 (p), (15) 

MP) = G <r * /,(p)/w(p). (16) 


where G a is a Gaussian kernel with standard deviation cr, 
a) is a normalization coefficient accounting for the shape of 
support domain: (o = f Go-(p-q)dq, and the small positive 

J7TiOb i j 

constant e is introduced to prevent the denominator from being 
zero. The derivative of m(7j,/ 2 ) with respect to any entry of 
I 2 at pixel position p has the following form j37|: 


d 2 m(p) = al\ (p) +Ph( p) + y(p), 

a(p) = Gcr* -p=(p), 

cosjv 1V2 

[HP) = G,r * — (P), 


y(p) = Go-* (— 
oj 



^)( P ). 

cov 2 


(17) 

(18) 

(19) 

( 20 ) 


Note that the variation at p also tends to affect the similarity 
measure of its neighboring position. Actually, if we restrict 
ZNCC in a local square window of size w, the variation of 
pixel p will affect its entire neighbouring pixels in the region 
of size 2w x 2w. 

2) Detail-preserving Similarity Measure Based on Guided 
Image Filtering: Let 7j be the filtering input, and / 2 be the 
guidance image. The principle of guided image filtering is to 
assume a local linear transformation between filtering output 
Q and a guidance image / 2 for any pixel p belonging to a 
local window with size of Wk (k is the center of window): 


( 21 ) 


By minimizing the difference between Q and I\ , we can obtain 
parameters a( p) and b( p): 

a( P) = (vi 2 /v 2 )(p), (22) 

b(p) = (p 1 - ap 2 )( p) = (p 1 - vi 2y u 2 /v 2 )(p). (23) 

Note that the tolerance e in Eq. © can also be included to 
penalize large a( p) in ( |22| ) and ( [23] ). The role of e in the guided 
filter is similar to the range variance cr 2 in the bilateral filter, 
which determines the edge patch that should be preserved. 
Finally, the filtering output Q has the following form: 


Grr ★ a Grr ★ b 

G(p) = —-(P)/ 2 (P) + —-(P). (24) 


We can then have an interesting connection between the 
derivatives of reprojection errors and guided image filtering. 
Based on ( [13] ), (fl8])-([2Q|), Eq. (T7] ) can be reformulated as: 


d 2 m(p) = (Go- ★ 


-1 


>y/viv 2 


(P))/!(P) + (Go- ★ 


Vl,2 


+Gcr ★ (- 


Fi 


( 0 V 2 yviv 2 

P2V\,2 


= (p))/ 2 (p) 


(o^Jv iv 2 6L>v 2 yviv 2 


=)(P). 


Based on (|22])-([23]), Eq. ( [24] ) can be rewritten 


(25) 


as 


Q(P) = (Go- * — (p))/ 2 (p) + G, r * {Vm Vm2) (p). (26) 

(OV2 COV2 


Let vi(p) = v 2 (p), Eq. 


becomes: 


d 2 m( p) = (G, t * —(p))/,(p) + (G^ * ^4(p))/ 2 (p) 


OJV 2 


, r * ( ^ ^vi,2 

+G^ ★ (-^-)(p). 

6o)V 2 


(27) 


We can then have: 

d 2 m(p) + (Go- * — (p))/ 1 (p) = (Go- * —4(p))/ 2 (p) 


6o)V 2 




(28) 


Suppose that v 2 (p) varies more slowly than and 7i(p) in 
spatial domain, and (Go- ★ — (p)) ~ 1. Then, Eq. ( [28] ) can be 
approximately rewritten as: 

v 2 d 2 m(p)+/i(p) « (G , t * — (p))/ 2 (p)+G fr * —^— Vl2/l2 \ p)- 

6 l>V 2 U>V 2 

(29) 

Note that the right sides of Eq. ( |26| ) and Eq. ( [28] ) are the 
same, and the minimization of reprojection error is actually 
the maximization of similarity measure. Therefore, we have 

h (p) + v 2 d 2 m( p) « Q(p), (30) 

and guided image filtering can be approximately interpreted 
as one step of variational image registration of A(p) and / 2 (p) 
with constraint vi(p) = v 2 (p) and stepsize v 2 . 

Motivated by the connection between guided image filtering 
and image registration, and to enhance the edge preservation 
of ZNCC-based similarity measure, we modify the derivative 
82 m in Eq. ( [17] ) by adding a term to enforce the constraint 


G(P) = a(V>)h(V>) + *(P)- 
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(a) (b) (c) 



(d) (e) (f) 


Fig. 3: Illustration of the proposed similarity measure, (a) One sample 
image from the temple sparse ring dataset, (b) The reconstruction 
result obtained by isotropic ZNCC similarity measure, (c) The 
result obtained by the proposed detail-preserving similarity measure. 
Comparing (c) with (b), one can see that the proposed method can 
better recovers the fine-scale details, (d)-(f) show another example 
from the Buddha dataset. 

Vi(p) = V 2 (p): 

m P) = a/i(p) +yS/ 2 (p) + y(p) + kG <t * (V2 ~ Vl) (p), (31) 

v 2 

where k is a tradeoff parameter to adjust the influence of the 
variance constraint. In practice, we initialize k with a small 
value in the beginning of surface evolution, and gradually 
increase it until convergence. By Eq. CEQ), the predicted image 
Iijjs is implicitly set as the guidance image. As shown in Fig. 
[3] the proposed similarity measure can recover the fine-scale 
details and largely extend the edge preservation capability of 
the original isotropic measure. 

B. Content-aware Mesh Denoising via L p -norm Minimization 

Let matrix Xq = (*o,i)” =1 6 ^ 3xn store the positions of the n 
vertices of the reconstructed noisy surface mesh S , and matrix 
X = (xi)* =1 e R 3x ” store the positions of the n vertices of the 
noisy-free surface mesh S k+0m5 . The mesh denoising problem 
in Eq. (12] ) can be reformulated as: 

min h\X - X 0 || 2 + AR(X). (32) 

X Z 

In this section, we first present the proposed content-aware 
model based on the MAP framework, and then propose an 
alternating minimization algorithm for content-aware mesh 
denoising. 

1) MAP-based Mesh Denoising with Hyper-Laplacian 
Prior: Denote by q(X) the prior on the sharpness of noise-free 
mesh, and by q(X$\X) the likelihood of noisy mesh. The MAP 
framework estimates X by maximizing a posterior probability 
q(X\Xo) oc q(Xo\X)q(X). By assuming that the noise is additive 
white Gaussian noise with standard deviation cr , the likelihood 
of noisy mesh can be modeled as: 



Fig. 5: Shape parameters of different 3D models. From left to right 
and from top to down, Armadillo (p = 0.5, 0 = 45.63), Eros (p = 0.5, 
6 = 46.62), Max Planck (p = 0.2, 6 = 11.64), Duck (p = 0.5, 
0 = 17.64), Buste (p = 0.75, 6 = 34.31), Raptor (p = 0.3, 6 = 195.3), 
Bunny (p = 0.42, 6 = 483.9), Ramesses (p = 0.35, 6 - 25.96), 
Elephant (p = 0.5, 6 = 357.3), Cervino (p = 0.35, 6 = 5.506) 

For surface mesh, the edge-based discrete Laplacian oper¬ 
ator D e R mxn proposed in He et al. [42] can be adopted for 
computing surface gradients (where m is the number of edges 
in the mesh). In image restoration, it has been empirically 
verified that the natural image gradients generally follow a 
heavy-tailed distribution and can be well described by hyper- 
Laplacian [481. Therefore, we suggest using hyper-Laplacian 
to model surface gradients: 

q(X\e,p) = PJ |(^)? A-expi-^iDDfp), (34) 

i ^ p' 

where T is the Gamma function, and p and 6 are the shape pa¬ 
rameters. p e [0,1] determines the peakiness and 6 determines 
the width of a hyper-Laplacian distribution. 

One concern is that whether surface gradients of real 
3D models follow the hyper-Laplacian distribution. Fig. [4] 
shows the empirical distributions and the corresponding hyper- 
Laplacian fits of the surface gradients of three real models. 
One can see that hyper-Laplacian fits the empirical distribution 
very well, which validates that the empirical distribution can 
be well modeled by hyper-Laplacian. 

It should be noted that, for different 3D models the shape 
parameters p and 6 will vary. To illustrate this, we compute 
the shape parameters on more than twenty public models, 
including Armadillo, Bunny in the Stanford repository |55| 
and models in the AIM@SHAPE repository p6| . We provide 
the shape parameters of ten models in Fig. [5] One can see that 
the p values vary from 0.2 to 0.75 and the 6 values vary from 
5.506 to 483.9. Therefore, instead of fixing shape parameters, 
p and 6 values should be adaptively estimated for different 
3D models. We propose the following content-aware mesh 
denoising model that jointly estimates the mesh X, noise level 
cr 2 , and and the shape parameters 6 and p from the observation 
X 0 : 

(X, cr 2 , 6 , p) = arg min {-log(g(X,cr 2 ,0,p|X o ))} 

X,(T-,0,p 1 > 

= arg ^" P {\ n[ ° gi2nCr2) + ^2^ + ^ DX) & + 05) 

3m |log(r(I)) - log(f) - I log(~)| j. 

2) Alternating Minimization: We propose an alternating 
minimization algorithm to minimize Eq. ( [35] ) for the joint 
estimation of noisy-free (denoised) mesh X, noise standard 
deviation cr, and hyper-Laplacian parameters 6 and p by 
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(a) (b) (c) (d) (e) (f) 

Fig. 4: Surface gradient distributions of three real 3D models: (a) the Red circular box model, (b) the Hand Olivier model, and (c) the Gargo 
model, (d)-(f) are their empirical distributions of sharpness (red) and the corresponding fitted hyper-Laplacian profiles (blue). 


iteratively solving the following two subproblems. 

(1) Given X , the optimization problem w.r.t. <x, 6 , and p can 
be reformulated as 

• f«, , 2\ , P-*0l| 2 l 

cr = arg rmn ^ - log(cr ) + —^— j-, (36) 

0, p = arg min I %DX)i\ p p + 

o,p Z 

y (37) 

3m |log(r(3)) - log(f) - 3 log(3)j 1. 

The cr- subproblem has the closed-form solution cr 2 = ||Xo - 
X\\ 2 /n. The problem in Eq. © can be solved by: ( a ) finding 
the optimal 6 for given p , which results in a closed-form solu¬ 
tion 6 = 3m/p\(DX)t\p, and ( b ) using a simple ID exhaustive 
searching strategy to obtain the estimation of p for given 6 . 

(2) Given cr, 6 and p , we define A = Ocr 2 / 2, and then we 
have 

X = ^gmm-\\X 0 -X\^ + A\\DX\\ p p (38) 

X Zj 

where A is the regularziation parameter. Using the variable 
splitting approach, Eq. ([38]) can be reformulated as: 

x = argmin h|Xb - *l| + MHp +P\DX - <fl 2 , (39) 

X 2j 

which can also be optimized with an alternating optimization 
method. Fix i//, X can be optimized by solving the following 
quadratic problem: 

X = argmin 3||Xo - X \\ 2 + f3\DX - i//\ 2 . (40) 

X Z 

Fix X , if/ can be optimized by solving the following subprob¬ 
lem: 

= arg min 4||^||^ + fi\DX - i//\ 2 . (41) 

The above subproblem can be efficiently solved by using 
the generalized shrinkage/thresholding (GST) method (49} . 
Solution to each (if/)i can be written as: 

m = Tf\{DX)i- 2), (42) 

where T p s T is the generalized shrinkage/thresholding operator 
149}. When the penalty factor (3 —> oo, the solution to Eq. 
( 39} converges to that of Eq. ( [38} . In practice, we adopt the 
continuation technique by initializing with a small value and 
gradually increasing it until convergence. 


V. Experiments 

We implemented the proposed DCV method using C++ 
with OpenGL, CGAL and TAUCS library. The predicted 
images are estimated using projective texture mapping, and 
OpenGL is adopted to generate the horizon and terminator of 
triangular surface. CGAL is used to manipulate the triangular 
mesh and TAUCS is used to manipulate the sparse matrix. 
We quantitatively and qualitatively evaluate the performance 
of DCV on multiple datasets, including the Middlebury bench¬ 
mark and several public datasets with indoor and outdoor 
scenes. These public datasets have camera calibration pa¬ 
rameters available. We also provide three real datasets, i.e., 
Buddha , Totoro and hell , taken from mobile phone or digital 
camera. For these three datasets, cameras are calibrated using 
the Bundler software (52| . In all experiments, the window size 
for calculating similarity measure is set as 7x7 pixels. 

A. Initialization and Implementation Details 

In our experiments, we consider two initialization methods: 
visual hull and PMVS+PSR (Possion Surface Reconstruction). 
The visual hull is the intersection of the visual cones associated 
with all image silhouettes, and can provide a good initialization 
for most indoor scenes where the interested object is easy 
to be segmented from background. The PM VS is an open 
source software designed by Furukawa and Ponce |43| . A set 
of dense patches are generated from PMVS with its default 
parameters and then a triangular surface mesh is estimated by 
using PSR 0 with octree depth fixed to 8. PMVS+PSR is 
mainly used to initialize the scene where the background is 
not easy to be segmented from foreground, including some 
outdoor and indoor scenes. For the temple dataset, because 
some small protruding structures tend to be over-smoothed 
when large concave region is recovered in the back of temple, 
we also use the PMVS+PSR to generate an initial mesh. The 
statistics of all the datasets used in our experiments are listed 
in Table [I] including the number of images, image resolution, 
initial points and running times (CPU i7, 2.4Ghz). 

Two issues, non-convexity and topology adaptivity, are 
considered in our implementation. Since the L p -sparsity (0 < 
p < 1) is used, the objective functional of DCV is non- 
convex, making the algorithm sensitive to local minimum. To 
alleviate this, we adopt a multi-resolution scheme. We first 
minimize the energy on low resolution mesh and downsample 
images accordingly, and then optimize it on high-resolution 
mesh and full-size images. The Gaussian pyramid is used to 
downsample the image. The Qslim algorithm (531 is used to 
simplify the mesh, and V3-subdivision scheme [54 ] is used to 

































subdivide the mesh to a higher resolution. Another issue is the 
topology adaptivity of mesh-based methods, which needs an 
initial surface of the model with an approximately consistent 
topology. We use two initialization methods, visual hull and 
PMVS+PSR, in the experiments. Other initialization methods 
can also be deployed, e.g., those methods based on features, 
fusion of depth maps, and volumetric optimization. 


TABLE I: Datasets used in our experiments 


Dataset 

Number of 
images 

Resolution 

Initialization 

Time 

(min) 

dino sparse 

16 

640x480 

visual hull 

90 

dino ring 

48 

640x480 

visual hull 

150 

temple sparse 

16 

640x480 

pmvs+psr 

105 

temple ring 

47 

640x480 

pmvs+psr 

170 

Beethoven 

33 

1024x768 

visual hull 

180 

bird 

21 

1024x768 

visual hull 

160 

fountain-Pl 1 

11 

3072x2048 

pmvs+psr 

210 

Herzjesu-P8 

8 

3072x2048 

pmvs+psr 

150 

Totoro 

8 

1504x1004 

pmvs+psr 

45 

Buddha 

5 

2400x1800 

pmvs+psr 

30 

bell 

3 

1504x1004 

pmvs+psr 

20 

statuegirl 

50 

2592x3888 

pmvs+psr 

750 


Sort By 

Full 

312 views 

Acc Comp 

o o 

[mm] [%] 

Temple 

Ring 

47 views 

Acc Comp 

o o 

[mm] [%] 

Sparse 

16 views 

Acc Comp 

o o 

[mm] [%] 

Full 

363 views 

O O 

[mm] [%] 

Dino 

Ring 

48 views 

o o 

[mm] [%] 

Acc Comp 

o ® 

[mm] [%] 

DCV 


0.73 98.2 

0.66 97.3 


0.28 100 

0.3 

100 

Kostrikov 


0.57 99.1 

0.79 95.8 


0.35 99.6 

0.37 

99.3 

Furukawa 2 

0.54 99.3 

0.55 99.1 

0.62 99.2 

0.32 99.9 

0.33 99.6 

0.42 

99.2 

Zaharescu 


0.55 99.2 

0.78 95.8 


0.42 98.6 

0.45 

99.2 

Furukawa 3 

0.49 99.6 

0.47 99.6 

0.63 99.3 

0.33 99.8 

0.28 99.8 

0.37 

99.2 

ECCV2014 1338 



0.62 95.2 



0.37 

99.2 

Tsinghua BBNC 






0.3 

99.1 

Liu2 



0.65 96.9 



0.51 

98.7 

Kolev3 


0.7 98.3 

0.97 92.7 


0.42 99.5 

0.48 

98.6 

Schroers 

0.57 99.1 

0.64 96.4 

2.12 62.9 

0.33 99.7 

0.33 99.7 

0.54 

98.6 

Hernandez 

0.36 99.7 

0.52 99.5 

0.75 95.3 

0.49 99.6 

0.45 97.9 

0.6 

98.5 

ZhaoxinLi 


0.66 98.0 

0.68 94.7 


0.39 99.1 

0.45 

98^ 

Hongxing 

0.83 95.7 

0.79 96.3 

0.97 93.9 

0.62 96.3 

0.5 99.1 

0.52 

98.4 

Liu 



0.96 89.6 



0.59 

98.3 

Kolev2 


0.72 97.8 

1.04 91.8 


0.43 99.4 

0.53 

98.3 


Fig. 6: Evaluation results of DCV on the Middlebury Benchmark. 

isotopic mesh smoothing. The method proposed in (26) use an 
anisotropic weighted minimal surface functional. The method 
(43) is a combination of patch-based method and isotropic 
surface refinement. 


B. Middlebury Datasets 

We first evaluate the effectiveness of DCV on the Mid¬ 
dlebury benchmark (T] by using two performance indicators: 
accuracy and completeness. The accuracy is measured by 
the distance d such that the distance between 90% of the 
reconstructed surface and the ground truth surface is less than 
d. Completeness is measured by the percentage / such that 
the distance between percentage / of the ground truth surface 
and the reconstructed surface is less than 1.25 mm. 

We test DCV on the dino ring (48-views), dino sparse ring 
(16-views), temple ring (47-views) and temple sparse ring (16- 
views), respectively. The smaller the number of views is, the 
more difficult and challenging the reconstruction will be. The 
accuracy and completeness of DCV on these four datasets are 
shown in Fig. [6] These results are also publicly available on the 
Middlebury evaluation page (50) and can be compared with 
state-of-the-arts. It is worthy to mention that (at the time that 
this paper is submitted) our DCV method achieves the best 
result on the dino ring and dino sparse ring in terms of both 
model completeness and accuracy. Though our results on the 
temple ring and temple sparse ring datasets are not top ranked, 
it can be easily seen that the visual quality of our results is 
better than most of the other top ranked methods. In Fig. [7] 
we compare the reconstruction results by DCV and several top 
ranked methods on the temple ring datasets. The reconstruction 
results of dino sparse ring and temple sparse ring by DCV are 
shown in Fig. [8] including the comparison with their coarse 
initializations, which indicates that the proposed method is not 
sensitive to initialization. 

The quantitative comparison results between DCV and 
several state-of-the-art methods (4), (25), (26), (29j, [29:], [35], 
[361, [43 ] are listed in Table [n| Since some reconstruction 
results were not reported by authors, we labeled them as 
We also conduct a visual comparison with four representative 
methods [261, (29), (36) , [43] in Fig. [9] on the dino sparse ring 
dataset. The methods proposed in (29) , (36) use the isotropic 
similarity measure for reprojection error minimization and use 


C. Results on the Other Datasets 

We further apply DCV to several other public datasets: the 
Beethoven dataset (25) , the bird dataset (26) , the fountain-Pl 1 
dataset [2], and the statuegirl dataset (64) . We also conduct 
experiments on two real datasets collected by us: Buddha and 
bell. Since groundtruths of these datasets are not available [] 
qualitative evaluation of the reconstruction results is adopted 
in the experiment. 

The Beethoven dataset and the bird dataset contain thirty 
three 1024 x 768 images and twenty one 1024 x 768 images, 
respectively. They were captured by a set of synchronized 
cameras. The Beethoven dataset presents textureless/smooth 
surface while the bird dataset presents highly textured surface. 
The reconstruction results of Beethoven by several state-of-the- 
art methods (29), (31), (36| [ and the proposed DCV are shown 


in Fig. 10 Thanks to the content-aware L p -minimization based 


denoising algorithm, DCV is able to effectively suppress noise 
and outliers while keeping the sharp features of surface. The 
results on the bird dataset by these methods are shown in 


Fig. |TT] If the isotropic similarity measure is used, small 
protrusions such as the wings, details of feathers, claws and 
head of bird are considered as high-frequency noise and 
are thus over-smoothed. With the proposed detail-preserving 
similarity measure, the fine-scale details are well preserved. 

The Buddha dataset consists of five 2400 x 1800 images, 
and it has been used to validate the proposed detail-preserving 
similarity measure in Fig. [3| Fig. |3(d)| is one sample image 
from the dataset. Fig. |3(e)[shows reconstruction results by 


using the isotropic ZNCC similarity measure, and Fig. |3(f) 
shows the results by using the proposed detail-preserving 
similarity measure. Note that the preserved features on the 
finger (left up close-up image) and wrinkles on the clothes 
(right down close-up image) of the Buddha statue. The re¬ 
sults on the bell dataset are shown in Fig. [121 The bell 


dataset consists of only three 1504 x 1004 images of a bell 

1 Although there was an evaluation system for quantitatively evaluating the 
fountain-Pl 1 dataset, the evaluation service is currently unavailable 
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TABLE II: Quantitative comparison between DCV and several state-of-the-art methods on the Middlebury data sets in terms of 
accuracy/completeness 


Method 

dino sparse 

dino ring 

temple sparse 

temple ring 

DCV 

0.3mm/100% 

0.28mm/100% 

0.66mm/97.3% 

0.73mm/98.2% 

Vu (3l] 

- 

0.53mm/99.7% 

- 

0.45mm/99.8% 

Kostrikov J4[ 

0.37mm/99.3% 

0.35mm/99.6% 

0.79mm/95.8% 

0.57mm/99.1% 

Zaharescu |z9| 

0.45mm/99.2% 

0.42mm/98.6% 

0.78mm/95.8% 

0.55mm/99.2% 

Kolev2 |25[ 

0.53mm/98.3% 

0.43mm/99.4% 

1.04mm/91.8% 

0.72mm/97.8% 

Kolev3 pjl 

0.48mm/98.6% 

0.42mm/99.5% 

0.97mm/92.7% 

0.7mm/98.3% 

G argali op?) 

0.76mm/90.7% 

0.6mm/92.9% 

1.05mm/81.9% 

0.88mm/84.3% 

Delaunoy |3o|j 

0.89mm/93.9% 

- 

0.73mm/95.9% 

- 

Furukawa3 (4J) 

0.37mm/99.2% 

0.28mm/99.8% 

0.63mm/99.3% 

0.47mm/99.6% 



Fig. 7: Comparison of reconstruction results of DCV on the Middlebury temple ring dataset. The names of the comparison methods follow 
the entries in the Middlebury evaluation website, (a) Vu (TT) , Ace. 0.45, Comp. 99.8%. (b) Campbell (62) , Ace. 0.48, Comp. 99.4%. (c) 
Furukawa3 (43) , Ace. 0.47, Comp. 99.6%. (d) Hernandez 119), Ace. 0.52, Comp. 99.5%. (e) th proposed DCV method, Ace. 0.73, Comp. 
98.2%. (f) groundtruth. 




(f) (g) (h) (i) CD 

Fig. 8: Reconstruction results of DCV on dino sparse ring (first row) and temple sparse ring (second row), (a) Some samples of dino sparse 
ring ; (b) and (d) are the visual hulls; (c) and (e) are the reconstruction results of DCV. (f) Some samples of temple sparse ring ; (g) and (i) 
are the reconstruction results of PMVS+PSR, which are adopted as the initial surfaces to DCV; (h) and (j) are the reconstruction results of 
DCV. 



Fig. 9: Comparison of reconstruction results on the dino sparse dataset. From left to right: results by (36), |26|, j29|, (43), the proposed 
DCV, and groundtruths, respectively. It is obvious that DCV has better performance in preserving the details and sharp features while filtering 
the noises. 
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Fig. 10: Reconstruction results by several state-of-the-art methods and the proposed DCV on the Beethoven datasets. From left column to 
right column: input images, results by (29) , (36) , 0 and DCV in two views, respectively. 


in a museum, and thus only a partial surface of bell is 
reconstructed. The less number of observed images makes 
the regularization scheme more important. One of observed 
images is shown in Fig. |12(a)| The initial surface of the bell 
is estimated by PMVS+PSR. The point clouds generated by 
PMVS and watertight surface generated by PSR are shown in 
Fig. 1 12(f) and Fig. |12(b)| respectively. Fig. |12(c)| Fig. 1 12(d) 


and Fig. 12(e)| show the reconstruction results of ’’isotropic 
similarity+isotropic regularization”, ’’detail-preserving simi- 
larity+isotropic regularization”, and DCV, respectively. The 
choosed isotropic regularization combines the first order and 
second order Laplacian (43). Fig. |12(h)| and Fig. |12(j)| show 


the close-up images of the corresponding results in Fig. 12(b) 


and Fig. 1 12(e)] It is clear that DCV presents the best results for 
both fine details and surface smoothness among all competing 
methods. 


The fountain-Pll is an outdoor dataset including eleven 
3072 x 2048 images. The results on fountain-Pll dataset are 
shown in Fig. L3 The input images, initial surface generated 
by PMVS+PSR and reconstruction results by DCV are shown 
in Fig. |13(a)| The comparison results of DCV with the 
isotropic method are shown in Fig. 1 13(b) It can be easily 
seen that DCV performs better than the isotropic method on 
preserving the fine-scale details and sharp features. 


The statuegirl is an outdoor dataset including fifty 2592 x 
3888 images. The input images, initial surface generated by 
PMVS+PSR, and the reconstruction results by DCV and 
commercial 3D reconstruction software Smart3Dcapture (free 
edition) (64) are shown in the first row, respectively. The 
close-ups images in different surface regions are shown in 
the second and the third rows. The images have been down- 
sampled by half before performing the reconstructions. For the 
Smart3Dcapture software, a complete and robust reconstruc¬ 
tion pipeline has been integrated, including camera calibration, 
dense reconstruction and visualization. We have used the 
software’s ultra high precision option to recover more details. 
For DCV, bundler is used for calibration and PMVS+PSR is 
used for initialization. The comparison results show that DCV 
can generally obtain similar results to Smart3Dcapture, and in 
some part (e.g., toes) it can recover more fine-scale details. 




(b) 

Fig. 13: (a) Results of DCV on th q fountain-Pll dataset. From left to 
right: several input images, initial surface, results obtained by DCV. 
(b) The left image shows the result obtained by the method based on 
isotropic similarity measure and surface regularization, and the right 
image shows the results obtained by DCV. Obviously, DCV performs 
better in preserving the small-scale details and sharp features. 

D. Evaluation on Content-aware Mesh Denoising 

Our DCV method consists of two components, i.e., detail¬ 
preserving similarity measure and content-aware L p mesh 
denoising. The effectiveness of the former component has been 
validated in Fig. [3] by comparing with isotropic ZNCC mea¬ 
sure. To evaluate the effectiveness of the latter component, we 
implement four variants of DCV by substituting the content- 
aware L p mesh denoising with four competing denoising 
methods, including the isotropic mesh smoothing method (i.e., 
the one based on the combination of first order and second 
order Laplacian (43)) and the anisotropic mesh denoising 
methods (i.e., two-step normal filterin g (40) , bilateral normal 
filtering 6D’ and Lo mesh denoising ||42)X Two datasets, i.e., 
Herzjesu-P8 and Totoro , are used for evaluating the mesh 
denosing methods. 

The Herzjesu-P8 dataset contains eight 3072x2048 images 
of a building with very sparse sharp edges (porch and stairs) 
and many flat regions (wall). The Totoro dataset contains eight 
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Fig. 11: Reconstruction results by state-of-the-art methods and the proposed DCV on the bird dataset. From left column to right column: 
input images, results by (29) , (36) , (3T) and DCY in two views, respectively. 














(f) (g) (h) (i) (j) 

Fig. 12: Reconstruction results on the bell dataset, (a) One of the input images, (b) The initial reconstruction using PMVS+PSR. The 
point clouds generated by PMVS are shown in (f). (c) Reconstruction result by isotropic similarity measure + isotropic smoothing, (d) 
Reconstruction result by detail-preserving similarity measure + isotropic smoothing, (e) Reconstruction result by DCY. (g)-(j) are the close- 
up images corresponding to the red rectangle regions in (b)-(j), respectively. One can see that DCV preserves well the fine details and smooth 
surface. 


1504 x 1004 images of a plastic status with many fine-scale 
details (grass in the ground, whiskers, fur and corn). As shown 
in Fig. |T5j on the Herzjesu-P8 dataset, the anisotropic methods 
have better performance than isotropic methods, and our 
content-aware L p denoising method achieves similar results to 
He et al.’s L 0 denoising method but is visually more pleasant. 
As shown in Fig. |T6j on the Totoro dataset, our content-aware 
L p denoising method obtains much better results than the 
other methods. Unlike the competing denoising methods, the 
proposed L p denoising method is content-aware and is able 
to reconstruct the object with flat regions, sharp edges, and 
fine-scale details. 

VI. Conclusion 

In this paper, we proposed a detail-preserving and content- 
aware variational (DCV) method for multi-view stereo (MVS) 
reconstruction. First, by connecting guided image filtering with 
image registration, a novel similarity measure was proposed 
to preserve the fine-scale details in reconstruction. Second, by 
the hyper-Laplacian modelling of surface gradients, a content- 
aware mesh denoising method based on L p minimization was 
presented to suppress the noises and outliers while preserving 


sharp features. Compared with state-of-the-art MVS methods, 
the proposed DCV method is capable of reconstructing a 
smooth and clean surface with finely preserved details and 
sharp features. The running time of our single-thread CPU 
implementation of DCV method on the datasets used in the 
paper is from twenty minutes to several hours. In the future, 
GPU-based parallel implementation on the main parts of 
gradient computation will be adopted to improve the speed 
of DCV by using the Nvidia CUDA framework. 
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Fig. 14: Results on the statuegirl dataset. First row, from left to right: one of input images, initial surface by PMVS+PSR, results by DCY 
and results by Smart3dCapture Free Edition with ultra high precision setting. Second and third row, from left to right: the close-ups of 
reconstruction results in the first row. 



(a) (b) (c) 



Fig. 15: Results by using different mesh denoising methods on the Herzjesu-P8 dataset, (a) Input images, (b) Results by the combination 
of first order and second order Laplacian (29], (43) . (c) Results by Sun et al.’s method (40) . (d) Results by Zhang et al.’s bilateral normal 
filtering (4lj |. (e) Results by He et al.’s L 0 denoising (42) . (f) Results by our L p denoising method. All the models are flat-shaded to show 
faceting effect. 
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show faceting effect. 
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